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Abstract 

Fisher's p-value is a mainstay in statistics but, despite its ubiquity, it is often 
misinterpreted as a sort of Bayesian posterior probability for the null hypothesis 
or as a frequentist error probability. In this paper, we propose a new interpreta- 
tion of p-value as a meaningful plausibility, where the latter is to be interpreted 
formally within the inferential model (IM) framework. In particular, we show that, 
for almost any practical hypothesis testing problem, there exists an IM such that 
the corresponding plausibility function, evaluated at the null hypothesis, is exactly 
the p-value for that test. The advantage of this representation is two-fold: first, 
the notion of plausibility is consistent with the way practitioners use and interpret 
p-values; second, the plausibility calculation avoids the logical difficulty of con- 
ditioning on the truthfulness of the null. This connection with plausibilities also 
reveals a shortcoming of standard p-values in problems with non-trivial parameter 
constraints. Illustrations of valid and optimal inference in the classical binomial 
model are given in this new context. 

Keywords and phrases: Fisher; hypothesis test; inferential model; nesting; op- 
timal binomial inference; plausibility function; predictive random set. 



1 Introduction 



Fisher's p-values are ubiquitous in applied statistics and, consequently, are covered in 
probably all modern textbooks on basic statistical methods. But despite this popularity, 
there is widespread misinterpretation of p-values as either a sort of Bayesian posterior 
prob ability that the null hyp othe sis is true, or as a frequentist error probability. See, 
Berger and Sellkel (119871 ) and lSellke et all ( 1200 ll ). Indeed, around the time we began 



e.g 
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writing this paper, there was a popular me dia buzz about the apparent discovery of the 
elusive Higgs boson particle ( jOverbvel 120121 ) and statistics blogs were ablaze with discus- 
sions about how some journalists, and apparently some physicists, had misinterpreted 
the resulting p- values. But our objective here is not to drag out this stale discussion; the 
ideas presented herein are new and different. 

Arguably, the primary reason for the frequent misinterpretation of p- values is that the 
standard textbook definition is inconsistent with people's common sense. The goal of this 
paper is to put a different spin on Fisher's p-value, one that admits a more user-friendly 
interpretation. We show that the p-value can be interpreted as a plausibility that the null 
hypothesis is true. This "plausibility" is precisely defined wit hin the infe rential model 



(IM) frame work, proposed recent ly in iMartin and Liu! ( 120121 ); see, also, iMartin et al. 
( 120101 ) and IZhang and Liul ( 120111 ). Specifically, consider the problem of testing a null 
hypothesis Hq versus a global alternative H\. We show that, under mild conditions, 
for any p-value (depending on H and choice of test statistic), there exists a valid IM 
such that the plausibility of H Q exactly equals the p-value. So, in this sense, Fisher's 
p-value can be understood as the plausibility, given the observed data, that H Q is true. 
In the Higgs boson report, since the p-value is minuscule (p <C 10~ 6 ) one concludes 
that the hypothesis Ho : "the Higgs boson does not exist" is highly implausible, hence, 
a discovery. T his line of reasoning; ba sed on small p- values is consistent with Cournot's 



principle (e.g.. IShafer and Vovkll2006l ). 



The proposed interpretation has two attractive properties. First, the word "plausi- 
bility" fits exactly with the way practitioners use and interpret p- values, i.e., a small 
p-value means Hq is implausible, given the observed data. Second, evaluating plausibility 
involves a probability calculation (see Section H]), but this calculation does not require 
one to assume that H is true. Therefore, one avoids the questionable logic of trying to 
prove H is false by using a calculation that assumes it is true. 

The remainder of the paper is organized as follows. Section |2] sets up our notation 
and gives the formal definition of Fisher's p-value, along with a brief discussion of its 
common correct and incorrect in terpretations. There are Bayesian versions of p-value 
( jGelman et al.lll996t lRubinlll984l ) but we shall focus exclusively here on the classical ver- 
sion. The basics of IMs are introduced in Section [3J in particular, predictive random sets 
and plausibility functions. In Section H] we prove the paper's main result (Theorem [2]), 
namely, that given essentially any hypothesis testing problem, there is a valid IM such 
that the corresponding plausibility function, evaluated at the null hypothesis, is exactly 
Fisher's p-value. Some remarks on the interpretation of this result are given there, as well 
as an illustrative binomial distribution example. In particular, we highlight (i) the con- 
nection between p-values and IM plausibilities implies a similar connection between the 
latter and objective Bayes posterior probabilities, and (ii) an important and apparently 
unrecognized shortcoming of p-values in problems with non-trivial parameter constraints. 
In Section |5l we reverse the result of Theorem [2] and investigate plausibilities from opti- 
mal IMs as "new and improved" p-values. We focus on optimal IMs in the fundamental 
problem of inference on a binomial success probability. A few concluding remarks are 
given in Section |6j 
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2 Fisher's p-value 



2.1 Setup and formal definition 

Let X denote observable data, taking values in a sample space X. There is a posited 
sampling model Px\e, indexed by a parameter 9 £ 0, and the goal is to determine which 
distribution or, equivalently, which 9 value is responsible for producing the observed data 
X = x. Here both X and 9 are allowed to be vector-valued, but we will not make 
this explicit in the notation. The hypothesis testing problem starts with a hypothesis, 
or assertion, about the unknown 9. Mathematically, this is characterized by a subset 
©o C 6, and we write H : 9 £ O for the null hypothesis and Hi : 9 £" O for the 
alternative hypothesis. The goal is to use observed data X = x to determine, with some 
measure of certainty, whether Hq or H\ is true. 

At a very high level, Fisher's description of the p-value can be viewed as follows. 
There are subsets X* of X which have the force of logical disjunction. That is, if the 
observed X = x lands in X*, then one of two things occurred: relative to H Q , a rare 
chance event has occurred, or H is false. The unlikeliness of former, a consequence of 
the definition of X*, drives us to conclude the latter. To put this in more standard terms, 
suppose there is a test statistic T : X — > R, possibly depending on O , such that large 
values of T(X) suggest that Hq may not be true. For example, if X = (X\, . . . , X n ) is an 
independent sample from a N(#, 1) population, and 6 = (—oo,9 ], then T(X) = X is a 
reasonable choice: since X should be close to 9, X being much larger than 9 suggests 
that H Q may not be true. In this case, Fisher's p-value is defined, for X = x, as 

pval(x) = pval T0o (z) = sup P X \e{T(X) > T(x)}. (1) 

0600 

In the special but common case where Go = {#o} is a singleton, i.e., a point null, then 
the p-value formula (JI} simplifies to pval(x) = pva\ T0 (x) = Px\e {T(X) > T(x)}, which 
is the expression found in most introductory textbooks. 

Intuitively, pval(x) compares the observed T(x) to the sampling distribution of T(X) 
when H Q is true. If pval(x) is small, then T(x) is an outlier under H and, by force of 
logical disjunction, we conclude that H is implausible, giving favor to the alternative 
H\. Conversely, if pval(x) is relatively large, then the observed T(x) is consistent with at 
least one Px\e, with 9 £ O , so H is plausible in the sense that it provides an acceptable 
explanation of reality. 

2.2 Standard interpretations 

The notion of a logical disjunction, described above, was the interpretation Fisher had in 
mind for p-value. Standard textbooks have adopted equivalent though (in our opinion) 
somewhat more convoluted interpretations. This is understandable, since the target 
audience for these books may not be familiar with language like "logical disjunction." 
The standard textbook interpretation of p-value goes something like this: 

pval(x) is the probability that an observable X is "at least as extreme" as the 
x actually observed, assuming H is true. 
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Such a definition, indeed, captures the essence of Fisher's force of logical disjunction. 
But many users tend to forget the words between "probability" and U H is true," leading 
to the common misinterpretation of p- value as a sort of Bayesian posterior probability of 
H . However, the fact that p-value conditions on the truthfulness of H makes it easy to 
see that no such connection can be made. 



In advanced texts (e.g. iLehmann and Romano! 120051 . Sec. 3.3), after laying out the 



details of the Neyman-Pearson testing program, p-values may be defined as: 

pval(x) is the greatest lower bound on the set of all a such that the size-a 
test rejects H based on T(x). 

While numerically equivalent, Fisher would surely disagree with defining his p-value in 
terms of Neyman's significance testing procedures. One danger with this formulation is 
that the conditioning on Hq is hidden in the definition of size. Consequently, based on 
this connection p-value with significance testing, users often misinterpret pval(x) as an 
error probability, i.e., the probability of incorrectly rejecting H based on x. 

In light of these common misunderstandings, some statisticians have abandoned the 
use of p-values, advocating for other tools for measuring e vidence supporting H n a nd /or 



testing Hq, such as confidence intervals and Bayes factors (IKass and Raftervlll995l ). Fur- 
ther discussion can be found in Section [6j but the main point is that these alternatives 
have their own shortcomings in terms of meaningful interpretations. For example, con- 
fidence levels are often mis-understood as containment probabilities, so the problem has 
not really been rectified. A new, user-friendly interpretation of the ubiquitous p-value is 
arguably more valuable than a new tool. 



3 Review of inferential models 
3.1 Big picture 

The inferential model (IM) framework produces prior-free posterior proba bilistic measures 
of evid e nce for /again s t any asserti on about the unknown parameter; see iMartin and Liu 
f l2012h . IMartin et all fl201p[ ). and Izhang and Liu] feoilf ). This is accomplished by first 
making an explicit association between the observable data X, the unknown param- 
eter 9, and an unobservable auxiliary variable U. The goal of this association is to 
establish the property/idea that the best possible inference on 9 is obtained if and only 
if the value of U is observed. As this condition can never be satisfied, the next best 
thing would be to accurately predict this unobserved value. This prediction is done via 
a random set, and the desired measures of evidence for the assertion about 9 are ob- 
tained via probability calculations with respect to the distribution of this random set. 
Though this IM framework is new and different, one can nevertheless find so me simi- 



lariti e s with several existing approaches, such a s fiducial and its generaliz ations f|Hannie 



2009 




2012 





20121; lHannig and Led 120091 ; IZabelll Il992l ) , confidence distributions (IXie and Singh 



Xie et al.ll201ll ). Dempster-Shafer theory (lDempsterll2008l ; IShaferlll976l. 12011), and 



2006 



Bayesian inference with default, reference, and/or data-depen dent priors (IBerger 
Berger et al1l2009l : lFraserll201ll : iFraser et al1l2010l : lOhoshlbOllh . This IM framework pro 
vides prior- free probabilistic measures of evidence in data for/against any assertion about 
the unknown parameter. 
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The IM framework was born from the authors' investigation of fiducial and Dempster- 
Shafer theory. Indeed, both of these classical approaches introduce an auxiliary variable 
into the inference problem. Both fiducial and Dempster-Shafer theory start by condi- 
tioning on the observed X — x, and then develop a sort of distribution on the parameter 
space by inverting the data-parameter-auxiliary variable relationship and assuming that 
U retains its a priori distribution after X = x is observed. The IM approach differs 
in that it targets the (unattainable) "best possible" inference described in the previous 
paragraph, i.e., the one obtained by observing the true auxiliary variable. In particular, 
uncertainty about 9, after X = x is observed, is propagated from our uncertainty about 
hitting the true U with our random set. There is a subtle difference between the ways 
uncertainty is introduced and propagated in the IM and fiducial frameworks, but it turns 
out the IM approach has a number of advantages. In addition to accomplishing Fisher's 
goal of prior-free probabilistic inference, IMs produce inferential output which is valid for 
any assertion/hypothesis o f interest (Section 13.3 ); fiducial probabilities are valid only for 
special kinds of assertions f lMartin and Liull2012l Sec. 4.3.1). Moreover, a general theory 



of optimal IMs, concerning efficiency of the resulting inference, may not be out of reach. 
3.2 Construction 



As described in iMartin and Liu! (120121 ). the construction of an IM proceeds in three 
general steps — an association step, a prediction step, and a combination step. Next is a 
summary of these three steps. 

A-step. This first step proceeds by specifying an association between X, 9, and U. 



Like fiducial, Dempster-Shafer, and Fraser's structural models (jFraserl Il968l ). this can 
be described by a pair (Pu,a), where Pu describes the distribution (and also, implicitly, 
the support) of the auxiliary variable U, and a describes the data-generating mechanism 
driven by Pu. That is, we write the association (Pu, a) as 

X = a(9,U), with U~P V . 

That is, if U is sampled from Pu and then plugged in to the function a for a given 9, then 
the resulting X would have distribution Px\e- This is familiar in the context of random 
variable generation for simulations, etc, but less familiar in the context of inference. Note 
that a need not be described by a formal "equation" — it is enough to have a rule/recipe 
to construct X from a given 9 and U; see e.g., Section S3) Finally, construct the following 
sequence of subsets of 6 indexed by (x, u): 

Q x (u) = {9:x = a(9,u)}. (2) 

Throughout it will be assumed that Q x (u) ^ for all (x, u) pairs. This assumption boils 
down to there being no non-trivial constraints on the parameter space 0. We will have 
more to say about this assumption in the upcoming sections. 

P-step. Based on the idea that knowing the unobserved value of U is "as good as" 
knowing 9 itself, the goal of the prediction step is to predict this unobserved value with 
a predictive random set, denoted by S. Certain assumptions shall be required for the 
support § and distribution P$ of the predictive random set; see Section | 
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C-step. Finally, the combination step combines the observed X — x, which specifies 
a sub-collection of sets Q x (-) in 02]), with the predictive random set S. The result is an 
x-dependent random subset of 0: 

©.(5) = 0*5 ©.(«)■ (3) 

Now, evidence for/against an assertion A C O concerning the unknown parameter can 
be obtained via the P^-probability that Q X (S) is/is not a subset of A. More precisely, 
we may evaluate 

be\ x (-;S) = P s {e x (S)Q-}, (4) 

the belief function, at both A and A c , as a summary of evidence for and against A, 
respectively. Alternatively, we can report be\ x (A;S) together with 

p\ x (A; S) = P s {e x (S) nA^0} = l- be\ x (A c ; S), (5) 

the plausibility function at A. It can be readily shown that be\ x (A;S) < p\ x (A;S) for 
any A and any S. Then, as described briefly bel ow, the pair {bel x (-; S), p\ x (-; S)} is used 



for inference about 9\ see iMartin and Liul (120 121 ) for details 



Statistical inference based on the IM output focuses on the relative magnitudes of 
he\ x (A;S) and p\ x (A;S). That is, an assertion A is deemed true (resp. untrue), given 
X = x, if both be\ x (A;S) and p\ x (A;S) are large (resp. small). Conversely, if be\ x (A;S) 
is small and p\ x (A;S) is large, then there is no clear decision to be made about the 
truthfulness of A, given X = x, so maybe one would collect more data. The definition of 
"small" and "large" values of belief/plausibility functions are specified by the theoretical 
validity properties discussed in Section 13.31 

One can also construct frequentist procedures based on plausibility functions. For 
example, for a G (0, 1), a 100(1 — a)% plausibility region for 9 is defined as 

U a (x) = {9:p\ x (9;S)>a}. (6) 

It is a consequence of Theore m [J below tha t this region achieves the nominal 1 — a fre- 



quentist coverage probability. Martini (120121 ) also considers a similar plausibility function- 



driven construction of exact frequentist methods. The IM optimality theory can also be 
used to investigate optimal confidence regions, but we will not discuss this here. 

It is sometimes more convenient to evaluate the plausibility on the u-space as opposed 
to the ^-space as in (J5J). Given x and A, let 

V X (A) = {u : Q x (u) C A}. (7) 

If Q x (u) ^ for all (x,u), as we have assumed, then belief and plausibility can be 
evaluated on the w-space as 

be\ x (A-S) = P s {SCU x (A)} 
p\ x (A;S) = l-P s {SCU x (A c )}. [ } 

This formulation will be used in the main result in Section HI 
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3.3 IM validity 

In addition to being posterior probabilistic in nature, which provides a sort of meaning- 
fulness within a given study, it is important that the belief and plausibility functions of 
the IM have numerical values that are m eaningful across studie s. This second type of 



meaningfulness is referred to as validity in lMartin and Liu! (120121 ) . More specifically, and 



IM is said to be valid at A if, for any a G (0, 1), 

sup P X | {bel x (A; S) > 1 - a} < a. (9) 

Precisely, this condition says that be\x{A;S) is stochastically no larger than Unif (0, 1) 
when A is false. Intuitively, (j^J) specifies our expectations about the possible values of 
be\ x (A;S) across different studies. That is, if A is false, then be\ x (A;S) will be large for 
only a relatively small proportion of possible x values. 

It will often be the case (see Theorem [1] below) that the IM is valid for any A. In 
such cases, the validity criterion can be expressed in terms of the plausibility function, 
which is of primary interest here. Specifically, under the conditions of Theorem [TJ below, 
we may conclude that, for any a G (0, 1), and any AC 8, 

svvP X \ 6 {v\ x {A-S) <a}<a. (10) 

This means that, if A is true, then plx(^4; <~>) is stochastically no smaller than Unif (0, 1), 
i.e., if A is true, then pl x (A; S) is small for only a small proportion of possible x values, 
"outliers." It is worth emphasizing that ( ITU]) holds, without special modification, even for 
the scientifically i mportant case of sing leton A. In fact, for reasonably chosen predictive 
random sets (see Martin and Liul 2012 . Corollary 1), the latter "< a" can be replaced 



by "= a;" hence p\ x (A;S) ~ Unif (0, 1) when A = {9 } is true. In Theorem [2] below we 
show that Fisher's p-value is nothing but a plausibility function at the null hypothesis. 
So ( TTUj) is a restatement of the familiar result that, if the null hypothesis is true, then the 
p-value is (stochastically dominated by) a Unif (0, 1) random variable. 



How can we be sure that results like (J9J) or (1101) hold? As shown by iMartin and Liu 



( 120121 ). it turns out that these properties hinge on characteristics of the predictive random 
set S alone — no requirements on the sampling model Px\e or the association (Pjj,a) are 
needed. Indeed, they prove a version of the following theorem. 

Theorem 1. The IM is valid for all assertions AC0, i.e., ( TTUj) holds for all A, if 
@z(w) 7^ for all (x,u) and the predictive random set S satisfies: 

PI. The support S C 2 U of S contains and U, and is: 

(a) Pjj -measurable, i.e., each S G § is Pjj -measurable, and 

(b) nested, i.e. for any S, S' G S, either S C S' or S' C S. 

P2. The distribution P$ of S satisfies 

P S {S CK}= sup Pu(S), K C U. 

SeS:SCK 



Martin and Liul (120121 ) show that a very wide variety of predictive random sets are 
available for which P1-P2 hold, so that IM validity is rather easy to arrange. However, 
analogous to the tradeoff between Type I and Type II errors in classical testing, one 
must consider both validity and a competing notion of efficiency. For th is, they lay out 



a theory of IM optimality; see, also, Section [5721 and IMartin et al.l (120121 ) . 
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4 Fisher's p-value as an IM plausibility 



4.1 Main result 

Recall the setup and notation in Sections [2H3J On the association (a, Pu), the null hy- 
pothesis Go, and the test statistic T : X — > T, we shall impose the following assumptions: 

Al. For every (x,u) there exists 9 such that T(x) = T(a(9,u)). 
A2. sup egeo T(a(9, ■)) is a P{/-measurable function. 

A3. P [ /{sup 0eeo T{a{9, U)) <t} = mi ee e Pu{T(a{9, U)) < t) for all t G T. 

Assumption Al implies that the Q x {u) constructed in the proof below is non-empty for 
all (x, u). Assumption A2 ensures the meaningfulness of the probability statement in A3, 
and holds generally under mild separability and continuity conditions, respectively, on @ 
and on T and a. Assumption A3 makes precise the stochastic smoothness and stochastic 
ordering T(X) should possess as a function of 9. Assumptions A2-A3 hold trivially for 
the important point-null case, i.e., Go = {#o}- It is also easy to check A3 in many common 
examples, e.g., if X u ...,X n are iid N(0, 1), and T(X) = X, then T(a(9, U)) = 9 + U, 
and A3 holds for any O of the form (— oo, 9q\. 

The main result of this paper is the following theorem, showing that the p-value is 
exactly the plausibility of H under a suitably constructed (and valid) IM. 

Theorem 2. Assume A1-A3 hold for the given association (a, Pu), hypothesis Go, 
and test statistic T : X — > T. Then there exists an admissible predictive random set S, 
depending on T and Go, such that the plausibility function pl x (Go;«S) equals pval(a;) = 
pval T0o (x) in (pQ) for all x G X. 

Proof. Without loss of generality, we may reduce the baseline association X = a(9, U), 
with U ~ Pu, to T(X) = T(a(9, [/)), again with U ~ Pu- In this case, the A-step of the 
IM construction generates the following collection of subsets G: 

Q x (u) = {9 : T(x) = T(a(9, n))}, x G X, u G U. 

These sets are non-empty for all (x, u) by Al. For the P-step, we define a collection 
§ = {S t ". t G T} of subsets of U as follows: 

S t = {u: sup egQo T(a(9, u)) <t}, t G T. 

The collection S is clearly nested, and P[/-measurability follows from A2. Thus PI in 
Theorem [T] holds. Now define a predictive random set S, supported on S, to have "dis- 
tribution function" as in P2, i.e., for any K C U, 

P S {S CK} = Pu(S t *) 

= inf P xv {T(X)<t* K }, (11) 

where t* K = sup{t G T : S t C K}; the last equality in (ITTj) is a consequence of A3. 
For S defined as so, the corresponding IM will be valid. For notational consistency, set 
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A = Go- The C-step proceeds as in the general case in Section f3T2l and, for any x G X, 
the resulting plausibility function flH]), evaluated at A, satisfies 

p\ x (A;S) = l-P s {SCI] x (A c )} 

= 1 - P S {S C S T{x) }. (12) 

Some elaboration on the equality in (I12p is in order. The sequence of implications dis- 
played below shows, first, that St( x ) Q ^x{A c ): 

u G S T (x) T(x) > swp e&A T(a(9,u)) 

T(x) = T(a(9,u)) 36 ^ A 
Ox(u) C A c 
=^«e U X .(A C ). 

It remains to show that St( x ) is the largest of the S t 7 s contained in U X (A C ). Indeed, for 
any small e > 0, there exists u G St{ x )+s such that T{x) < sup deA T(a(9,u)); for this u, 
^ A c , so u ^ V X (A C ). Thus, we have verified (1T21 . Therefore, 

pl x (A ;i S) = l-P 5 {5c^ r(:c) } 

= l-inf P X |4T(X)<T(x)} [byjm)] 

= sup Px|e{T(X) > T(x)}. 

The right-hand side is pval(x) in ([T]), completing the proof. □ 
Corollary 1. Under Al, z/Oo = {@o}, then the conclusion of Theorem^ holds. 
Proof. Conditions A2-A3 hold automatically for singleton O and suitable T. □ 

4.2 Remarks 

Theorem [2] and its corollary demonstrate that, under mild conditions, Fisher's p- value 
can be interpreted as a plausibility of H for a certain well-behaved IM. The advantage 
of such a correspondence is that a plausibility has a very natural interpretation. In 
particular, it is a direct measure of the evidence in data X = x that Hq is true — an 
upper bound on the belief probability of H . So, if pval(x) is small, then evidence in 
favor of H is slight, suggesting Hi is a better model. Conversely, if pval(x) is not 
small, then there may be non-negligible support in x for H , so H Q is a plausible model. 
This is consistent with the standard use of p-value. However, on careful inspection, 
one sees that the calculation of pl a .(©o; <S), a numerical equivalent to pval(x), does not 
require an assumption that Hq is true. This is particularly important, for it is this subtle 
conditioning on the truthfulness of H Q that makes standard interpretation of p- values so 
difficult, in turn, causing misinterpretation. But interpretation of p-value as a plausibility 
av oids this di f ficulty 



Dempster! (120081 . p. 375) points out a similar connection between plausibility and p- 
value. Specifically, he shows numerically how Fisher's p-value can be decomposed into 
two pieces — one piece corresponding to belief in H and the other corresponding to "don't 
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know" — the sum of which is our plausibility. But his example is for the standard test 
for a Poisson mean based on a one-sided alternative hypothesis, and he stops short of 
claiming such a correspondence for general models, hypotheses, etc. So Theorem [2] above 
goes beyond what Dempster had observed. 

It is also interesting to note that, in the Baye sian setting, a search for "objective" 



priors often focuses on probability matching (e.g., iGhoshl l201ll . and references therein). 
In particular, the goal is to choose the prior such that the corresponding posterior tail 
probabilities and p- values are (asymptotically) numerically equivalent. But given the 
connection between p-values and IM plausibilities in Theorem [2j these objective Bayes 
posterior probabilities can also be interpreted as plausibilities. This is perhaps not sur- 
prising given that (objective) Bayes posterior distrib utions can b e viewed as a simple and 



attractive way to approximate frequentist p-values flFraserll201ll ). 

This connection between p-values and plausibilities also casts light on the argument 

m 



Schervishl (119961 ) concerning the use of p-values as measures of evidence. He shows 
that, in general, p-values fail to satisfy a certain coherence property, i.e., that, for a given 
x, if 6g C 6 0) then the p- value for G should be no more than the p- value for O . The- 
orem [2] explains this somewhat unexpected behavior rather easily: p-values for different 
hypotheses are plausibilities with respect to two, possibly different IMs, and since the 
scales of these plausibilities are not necessarily the same, comparison across hypotheses 
may not be coherent. The same would be true for Bayesian posterior probabilities for Q 
and Go if different priors are used for each testing problem, which would not necessarily 
be out-of-the-ordinary. 

In the case that G x (n) = for some pair (x,u), the derivations above break down, 
i.e., constructing an IM with plausibility function matching the p- value cannot be done 
as described in the proof of Theorem [2J Such a situation arises in, e.g., a normal mean 
problem N(0, 1) with G = [—1,1], say. If X = — 1 is observed, then Q_i (u) = \6 g 



1, 1] : — 1 = # + $ 1 (m)} is empty for u > 1/2. For such problems. lErmini Leaf and Liu 



(120121 ) present a modification of the IM approach which stretches the predictive random 
set just enough so that G x (iS) is not empty while maintaining validity. The result of this 
stretching is, in general, an increase in the plausibility function. That is, a constraint on 
the parameter space will tend to increase the plausibility for a given assertion/hypothesis. 
The p-value depends only on the null hypothesis, so will not be affected by parameter 
constraints. Our position is that this is a shortcoming of the p-value, not of the IM 
approach. That is, evidence for a particular assertion should automatically become larger 
when the range of possible alternatives shrinks. Consider the extreme example where the 
null hypothesis is exactly the constrained parameter space. In that case, all evidence 
should point to the truthfulness of this hypothesis — the (adjusted) IM approach handles 
this easily, while the p-value misses the mark completely. 

4.3 Binomial example 

Consider a binomial model, X ~ Bin(n, 8), where n is a known positive integer and 
9 g (0, 1) is the unknown success probability. In this case, the natural association is 

F e (X -1) <U < F e (X), U~ Unif(0,l), 
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where Fg is the Bin(n, 9) distribution function. Notice that there is no simple equation 
linking (X, 9, U) in this case, just a rule U X = a(9, U)" for producing X with given 9 
and U. In this section we shall construct the p-value-based IM developed in the proof of 
Theorem [2] for a one-sided assertion/hypothesis. 

First, consider a one-sided assertion A = (0,6> ], for some fixed 6*0 G (0, 1). If the null 
hypothesis is Hq : 9 G A, then the uniformly most powerful test rejects Hq in favor of 
H\ : 9 G A c if and only if T(X) = X is too large. With this choice of T, for the A-step, 
we may write 

<d x (u) = {9:T(x)=T(a(9,u))} 

= {9 : Fg(x -l)<u< F e (x)}. 

If G a ,b denotes the Beta(a,6) distribution function, then we may rewrite x -(u) as 

e x (u) = {9 : G n - x+l>x {l -9)<u< G n - x>x+1 {l - 9)} 
= {9:1- G- l _ x+hx (u) <9<l- G-^ x+1 (u)}, 

which is clearly a sub-interval of (0,1). For the P-step, we construct the support § = 
{S t : t G T}, where, in this case, T = X = {0, 1, . . . , n}. It is easy to see here that 

S t = {u : sup egA T(a(9,u)) < t} = {F 6o (t), 1). 

When equipped with the natural measure, determined by Pjy = Unif(0,l), the C-step 
produces a plausibility function for A = (0, 9o], at the observed X = x, given by 

p\ x (A;S) = l-F <h (x), 

which is exactly the standard p-value for the one-sided test in a binomial problem. 



5 Improved p-values via optimal IMs 



5.1 Main idea 

In the previous section we demonstrated that, given any p-value, one can construct an IM 
such that the plausibility function evaluated at the assertion of interest is exactly that p- 
value. This suggests that p-values are nothing but IM plausibilities, and this connection 
provides some nice intuition and may help ease interpretation of p-values. However, 
this connection can be reversed — IMs can be used to develop "new and improved" p- 
values. That IM plausibilities have a natural approach to rectify issues resulting from 
non-trivial parameter constraints is another reason for preferring these over ordinary p - 
valu es. Also, as we illus trate below, the theory of optimal IMs in Martin and Liul (120121 ) 
and Martin et al.l (120121 ) comes in handy here. 



5.2 Optimal IMs and p-values 

The presentation here on optimal IMs will be kept brief; a more deta il ed de scription of 
the ideas and computational methods can be found in Martin et al.l (120121 ). Here we 
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focus on t he binomial c ase presented in Section 14.31 This is a fundamen tal problem in 
statistics ( jPearson!ll920l ) and, despite its simple setup, iBrown et all ( 1200 ll ) show that the 
problem is far from trivial. The optimal IM analysis presented below can be viewed as a 
supplement to the latter paper. 

Consider a "two-sided" assertion A = {0 } c , where 6> is some fixed value in = [0,1]. 
This is an important kind of assertion, for it appears, albeit indirectly, in the evaluation of 
the plausibility region fl6]) . This is also a challenging problem, for it parallels the frequen- 
tist s earch for a most powerful test of a point null hypothesis (e.g.. lLehmann and Romano 



2005 



Sec. 3.7). The goal here is to construct an "optimal" predictive random set. To 
be precise, we shall focus on the class of predictive random sets S that satisfy P1-P2 in 
Theorem[TJ For such an S, we know by Theorem [TJ that belx({#o} c ; S) is stochastically no 
larger than Unif (0, 1) when X ~ Px\e a - The goal is to choose S such that belx({#o} c ; <S) 
is (stochastically) as large as possible under X ~ Bin(n, 9), 6 ^ 6q. 
Let X = {0,1, ... ,n} and recall the basic association 

e x {u) = [1 - G~l x+1)X {u), 1 - G~\ X)X+1 {u)), x e X, u e (0, 1), 

from Section 14.31 above. In this case, we have 

U,(W C ) = [Gn-x+iA 1 ~ 8o),G n - x , x+1 (l - 6 )) c , xeX. 



Following iMartin et al. fl2012[ ). we shall consider predictive random sets S = S p deter- 
mined by a particular ranking p of the x values. Write T = X and define the support 
§ p = {S p>t ■ t G T} as follows: S = and, for t — 1, . . . , n, 



x:p(x)>t 

— [G n -x+l,x(^- 

x:p(x)<t 



#o)> G n - XtX+ i 



(1 -*(>))■ 



The support S p is clearly nested. If we take the distribution of the corresponding predic- 
tive random set S p to satisfy P2, then the assumptions of Theorem [TJ are satisfied. Note 
that S P) t, with t = p(x) — 1, is the largest of the S p /s contained in ^({^o]^)- Therefore, 
the belief function at A = {6 } c based on the predictive random set S p is given by 



be\ x ({9 } c ;S p ) = P u {S pAx) _ 1 } 

[Gn-t,t+l0- 



E 

t:p(t)<p(x) 



) 



G n — 



n-t+l,t 



'l-0r 



t:p(t)<p(x) 

where fe is the binomial probability mass function. Then the goal is to choose the 
ranking p such that be\x({0o} c ] <S P ) is stochastically l arge w hen X ~ Bin(n,6 ) ), 6 ^ 9 . 

For inference on a Poisson mean, IMartin et al.l (120121 ) propose a recursive scheme 
for choosing such an optimal ranking p = p*. This scheme carries over exactly to the 
binomial case here so, for brevity, we refer the reader to the aforementioned paper for the 
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0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 

(c) x = 4 (d) x = 5 

Figure 1: "Optimal" plausibility function (black) and the p- value function (gray) for the 
binomial problem with n = 20 and various x values. 



details. Here we just show a few results for illustration. Figure [T] shows the plausibility 
function 

pU0 o ;<V) = i- E /*(*) 

*:P*(*)<P*(z) 

as a function of 6* for several values of x with n = 20. For comparison, we also show the 
p- value function for the exact binomial test of H : 9 = 6 . The horizontal line at a = 0.1 
determines 90% plausibility/confidence intervals for 9. In this case, we can see that when 
x is close to 0, the plausibility intervals are slightly shorter, while the opposite is true 
when x is close to n/2 = 10. A similar relationship holds when x/n is greater than 0.5. 
Similar relationships hold for other values of n. 
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6 Discussion 



In this paper we have developed a new user-friendly interpretation of the familiar but 
often misinterpreted p- value. Specifically, using the language of inferential models (IMs), 
we have shown that, for essentially any hypothesis testing problem, under mild conditions, 
there exists a valid IM such that its plausibility function, evaluated at the null hypothesis, 
is exactly the usual p-value. 

The advantages of this new interpretation are two- fold. First, the name "plausibility" 
is consistent with the way practitioners use and understand p-values. Indeed, a small p- 
value means the hypothesis is implausible, so the natural choice is to reject H ; conversely, 
a large p-value means the hypothesis is plausible (though not necessarily true) and so 
the natural choice is to not reject Hq. Second, there is the implicit but important fact 
that the IM plausibility does not require an assumption that Hq is true. Arguably, it 
is this almost-hidden conditioning on the truthfulness of H that is responsible for the 
widespread misinterpretation of p-values. Although the IM construction depends on H , 
the underlying probability calculation does not depend on Ho being true, so this logical 
difficulty is altogether avoided. Moreover, the representation of p-values in terms of IM 
plausibilities casts light on a potential shortcoming of p-values that can arise in problems 
with non-trivial parameter constraints. In such cases, it is not clear how to modify the 
p-value, whi le modifications of th e IM p lausibility are readily obtained via the methods 
described in Ermini Leaf and Liul ( 120121 ) . 



There are a numerous alternatives to p-value in the hypothesis testing literature, and 
these are popular, at least in part, because of the difficulties in interpreting p-values. For 
example, Jim Berger (and co-authors) have recommended c onverting p- v alues to Bayes 



factors, or posterior odds, for interpretation; for example, ISellke et al.l (120011 ) make a 
strong case for their suggested "— eplogp" adjustment. However, it is unlikely that p- 
values will ever disappear from textbooks and applied work, so compared to offering an 
alternative to the familiar p-value, it may be more valuable to offer a more user-friendly 
interpretation. To borrow an analogy Larry Wasserman used on his blogfj] many people 
are poor drivers, but the answer to this problem is not to ban cars. 

Aside from the main goal of re-interpreting Fisher's p-value, this connection between 
plausibility and p-value also casts light on the nature o f the I M output. IM belief and 



plausibility functions are understood in Martin and Liul ( 120121 ) as measures of evidence 



given data. But seeing that, in some cases, plausibility and p-value matches up provides 
some helpful additional information. Indeed, it suggests that one should reason with IM 
plausibilities in the same way one reasons with p-values. At a somewhat higher level, 
the correspondence between plausibilities, p-values, and some objective Bayes posterior 
probabilities, and the fact that IMs contain the fiducial/Dempster-Shafer paradigms as 
special cases, suggests that the IM framework may in fact provide a unified perspective 
on robust, objective, probabilistic inference. 



1 http : //normaldeviate .wordpress . com/2012/07/11/ - cf. "p-value police" 
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